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Abstract- When a knowledge base represents the experts’ uncertainty, then it is reenable 
we have to ask to these P , know what the real world is, we 

r r. »“ t -u » — .. — - - — - * 

natural measure of completeness for a given knowledge base. 
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We give such estimates for Dempster-Shafer formabsm. Namely we show that h 

her of auestions can be obtained by solving a simple mathematical optimi 
average -nber of — ^ ^ is not ^ sufficient to express the fact 

tion problem. P P , Fo r example it has the same value if we have an 

that sometimes we have more knowle g . ^ ^ , { ^ - s ^ additi onal 

event with two possible outcomes an n from the practical 
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questions in both cases is practically negligible. 

Keywords: complexity of knowledge acquisition, Dempster-Shafer formalism. 


1. BRIEF INTRODUCTION TO THE PROBLEM. 
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of necessary questions. These estates are a natural measure of completeness for a g.ven 
knowledge base. 

Estimates of incompleteness are useful. Such estimates can be useful in several case- 
For example suppose that we feel like our knowledge base needs updating an we wa 
“ he cost of the update. The main part of updating is the acquisition of the new 
knowledge from the experts. Since it is desirable to take the best (and therefore = highly 
paid) specialists as experts, the knowledge acquisition cost is an essential part of 

2,. «... a™ .« •— - - <•* *• 7, r~ r .‘ 

by dividing the previous update cost by the number of questions asked, 
total acquisition cost, we multiply c by the number of necessary questions. 

Another situation where these estimates are applicable is when we choose between 
the existing knowledge bases (for example, when we decide which of them to uy V e 
choosing we must take into consideration cost, performance time, etc. But the 
characteristic of the knowledge base is how much information it contains is . cn 
estimate this amount of information directly, but we can use the estimates of 
of questions if they are available: Evidently the fewer queens we meed to > * mn o rd er 
obtain the complete knowledge, the more information was there initially. So g 
base, for which we have to ask the minimal number of questions, is the one with g 

amount of information. 

What we are planning to do. There exist several different formalisms for representing 
. / c mets e t al 1988). In the present paper we estimate the neces 

“ i 

that this average number of questions can be obtained by solving a simple mathematic 
optimization problem. 

It turns out that the same techniques can be applied to estimate the complexity of 
knowledge acquisition for the probabilistic approach to uncertainty (Nilsson, ). 

It seems desirable to have such a characteristic of uncertainty that if we add addi- 
tional information (i.e., diminish uncertainty), we decrease the value of this characteristic. 


Strictly speaking, our characteristic (average number of binary question) do not satisfy 
this property. For example, it has the same value if we have an event wit 1™P™ e 
outcomes and nothing else is known, and if there is an additional knowledge that he prob- 
ability of every outcome is 0.5. We'll show that from the practical viewpoint this is 
problem, because the difference between the necessary number of questions in both cases 

is practically negligible. 

The main results of this paper appeared first in (Chokr et al, 1991). 

The structure of the paper is as follows: there exists a well-known case, where a formula 
for the average number of questions is known: the case of probabilistic knowledge, that 
was considered in the pioneer Shannon papers on information theory. We are planning o 
use the same methods that were used in its derivation. Since the derivation is not as 
known as Shannon’s formula itself, we’ll briefly describe it in Section 2. In Section 
formulate a corresponding problem for Dempster-Shafer formalism in mathematical erms 
and present our results. In Section 4, we’ll show that this characteristic is some imes no 
sufficient, but from practical viewpoint there is no need to worry. In Section 5 we apply 
the same techniques to the case of a probabilistic knowledge. Proofs are in Section . 

2. SHANNON’S FORMULA REVISITED 

First let’s analyze the simplest possible case: formulation. Before we actually 
analyze Shannon’s formula, let us recall how to compute the complexity of knowledg 
acquisition in the simplest case: Namely, we consider one event, and we know beforeha 
that it can result in one of finitely many incompatible outcomes. Let s denote ese 

outcomes by A„ .4s and their total number by n. For example in the com tossing case 

„ equals two, and A, and .4, are “heads" and “tails”. If we are describing weather, then 
it is natural to take “raining” as “snowing” as At, etc. How many binary questions 
do we have to ask in order to find out which of the outcomes occurre . 

The simplest case: result. The answer is well known: we must ask Q questions, where 
Q is the smallest integer that is greater than or equal to log, n. This number is: sometimes 
called the ceiling of log, n and is denoted by [log, n[ . And if we ask less than Q questions, 
we will be unable to always find the outcome. 

Although the proof of this fact is well-known (see, e.g„ Horowitz and Sahni, 1984), 
we repeat it here, because this result will be used as a basis for all other estimates. 


The simplest case: proof. First we have to prove that Q questions are suffioen - 

Indeed let’s enumerate all the outcomes (in arbitrary order) by numbers from 0 to n , 
and write these numbers in the binary form. Using binary numbers with , digits, one ge s 
numbers from 0 to 2’ - 1, that is, totally 2- numbers. So one d,g,t ,s sufficient for n , , 

two digits for n = 1, 2 4, , digits for n = 1, 2, ...,2», and in order to represent n numbers 

we need to take the minimal , such that V > n. Since this inequal, ty ,s equivalent to 
, > log, n, we need Q digits to represent all these numbers. So we can ask the follow, ng 
Q questions: “is the first binary digit 0?”, “is the second binary digit 0?”, etc, up to ,s 

the q- th digit 0?”. 

The fact that we cannot use less than Q questions is also easy to pro\e. , 

suppose we use , < Q questions. After we ask g binary questions, we get a sequence o 
q 0’s and l’s (g bits). If there is one bit, we have 2 possibilities: 0 or 1. e ave q i s, 
so we have 2 ■ 2 • 2... ■ 2 (, times) = 2* possible sequences. This sequence is the only thing 
that we use to distinguish outcomes, so if we need to distinguish between n outcomes, we 
need at least n sequences. So the number of sequences 2- must be greater than or equal 
to n- 2 5 > n Since logarithm is a monotonic function, this inequality is equivaen o 
q > log, n. But Q is by definition the smallest integer, that is greater than or equal to 
this logarithm, and q is smaller, than Q. Therefore q cannot be > log 2 n, and hence q < Q 
questions are not sufficient. 

Situations that are covered by Shannon’s formula. The above formula works fine 
for the case when we have a single event, and we need to find what its outcome is. But in 
many real- life cases same types of events happen again and again: for example we can toss 
the coin again and again, and we must predict weather every day, etc. In such cases ere 
is a potentially infinite sequence of repeating independent events. By the moment when 
we are asking about the outcome of the current event, we normally already know w a 
outcomes happened before, which of them were more frequent, which were more seldom. 

In some cases these frequencies change essentially in course of time: for example, in 
case of the global warming the frequencies of cold weather days will become smaller and 
smaller. But in many cases we can safely assume that these frequencies are more or less 
the same. This means that the outcomes, that were more frequent m the past, will still e 

more frequent, and vice versa. 

Of course, the frequencies with which some outcome occurs in two long sequences 
of N events, are not precisely equal. But it is usually assumed, that the larger N is, the 
smaller is the difference between them. In other words, wheniVtends to oo, the frequencies 
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. , r that is caUed a limit frequency, or a probability p, of an outcome i. 

* — - - —7 for 

7Z probabilities: the bigger sample we take, the better are these estates. 

These frequencies are the additional information, that Shannon (1948) used to dimin- 
ish the number of necessary questions. 

Why probabilities help to diminish the number of questions: explanation > in 
y P terms. If we have just one event, then probabilities or no probabilities, 

TeTThlveTo ask all Q = flog, n] questions. However, if we have N similar event^ 
Id w! are interested in knowing the outcomes of all of them, we do not have to ask Q 
I JL all IV times: we can sometimes get out with less than QN questions and st:„ 

know all the outcomes. 

L et;s give a simple example why H, P-^ 

7 "uestm 1 lX- consider the case of 10 events. If we knew no Pities 
lere would be S» = 1024 possible combinations of outcomes, and so we need to ask 
least 10 = log, 1024 questions in order to find all the outcomes. 

But we do know the probabilities. And due to the fact, that the probability of the 
second event is very small, it is hardly unprobable, that there will be 2 or more cases ou 
of 10 with the second outcome. If we neglect these unprobable cases, we^ndud th 
there are not 1024, but only 11 possible combinations: second on come n first ev nt,^ 

• all the Other- second outcome in the second event, first m all the o , - 

” t 7 Tthe eleventh which corresponds to first outcome in all the events. To 

combinations), an questions. On average 

find a combination out of 11 possible we need only | log* 11 1 * 

we have 4/10 questions per event. 

So if we neglect low probability combinations of outcomes, then we can drastically 
So, if we negle P what , f we do nol neg lect them? Let us show 

tltThl irieTulber of binary questions can still be kept small. Indeed, jnlh*^ 
example, we can consider 12 mutually exclush-e IHIuLome 

71!: r; P n i ^ — d we t 

r rx rrr ri “ = 


class, we still have to ask 10 additional questions to find out the actual outcomes o al 
events. In this case we need 10 additional questions, but this case is very rare (probability 
< o.Ol). Therefore, it adds < 0.01 ■ 10 = 0.1 to the average number of questions. So, we 
can handle rare cases with a small effect on the average number of questions. 

The above-given example may look purely mathematical, but it has lots of real-world 
applications. As an example, let us take technical diagnosis: a system doesn’t work, and 
we must find out which of n components failed. Here we have two outcomes: good an 
failed. In case the reliability of these components is sufficiently high, so that p 2 < 1, we 
can neglect the possibility of multiple failures, and thus simplify the problem. 

Some statistics. When talking about Shannon’s theory one cannot avoid using statistics. 
However, we’ll not copy (Shannon, 1948): instead we reformulate so that it would be easy 
to obtain a Dempster-Shafer modification. 

Suppose that we know the probabilities p„ and that we are interested in the outcome 
of N events, where N is given. Let’s fix . and estimate the number of events N„ in winch 

the outcome is i. 


This number N> is obtained by adding all the events, in which the outcome was *, 
so Ni = m + n 2 + + ns, where n k equals to 1 if in fc-th event the outcome is * and 0 

otherwise. The average E(n k ) of n k equals to pi ■ 1+ (1 - Pi) ■ 0 — P»- The mean squ 
deviation a\n k \ is determined by the formula a 2 [n k ] = Pl {\-E{n k )) 2 + (l-pi)(0- («*)) ■ 

If we substitute here E(n k ) = p„ we get = p,(l - P.)- The outcomes of all these 

events are considered independent, therefore n k are independent random variables. Hence 
the average value of N t equals to the sum of the averages of n k : E{N t ] = E[n t] + E[n 2 ] + 
... + E[un ] = Npi. The mean square deviation a [iV,] sat isfies a corre sponding equation 

a 2 [Ni] = a 2 [n x \ + a 2 [n 2 ] + - = NpiO- - P*)» so °[ N i\ = ~Pi) N - 

For big N the sum of equally distributed independent random variables tends to 
a Gaussian distribution (the well-known central limit theorem .), therefore for big N we 
can assume that N t is a random variable with a Gaussian distribution. Theoretically a 
random Gaussian variable with the average a and a standard deviation a can take any 
value However, in practice, if, e.g., one buys a measuring instrument with guaranteed 
0.1V standard deviation, and it gives an error IV, it means that something is wrong with 
this instrument. Therefore it is assumed that only some values are practically possible. 
Usually a “fc-sigma” rule is accepted that the real value can only take values from a - ka 
to a + ka , where k is 2, 3 or 4. So in our case we can conclude that N t lies between 
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ready for the formulation 


Npi - IfcyWl-pTW Np , + ky/ Pi (l-Pi)N. Now we are 

of Shannon’s result. 

Comment. In this quality control example the choice of k matters, but, as we’ll see, in our 
case the results do not depend on k at all. 

Formulation of Shannon’s results. 

Definitions. Suppose that a real number k > 0 and a positive integer n are given, n 
is called the number of outcomes. By a probabilistic knowledge we mean a set M of n 
real numbers, p x > 0, £ V* = 1- Pi is called a Probability of i-th event. 

Suppose that an integer N is given; it is called the number of events. By a result of 
N events we mean a sequence r k , 1 < k < N of integers from 1 to n. r k is called the result 
ofk-th event. The number of events, that resulted in i-th outcome, will be denoted by Ni. 
We say that the result of N events is consistent with the pro babilistic knowled ge {p,} iffor 
every i the following inequality is true: Npi~ky i /p l (l - pt)N < Ni < Np t + k^/pi( 1 p%)N. 

Let’s denote the number of all consistent results by N con s(N). The number 
\log 2 { Wconsi JV))1 will be called the number of questions, necessary to determine the results 
of N events and denoted by Q(N). The fraction Q{N)/N will be called the average number 
of questions. The limit of the average number of questions will be called the complexity 

of knowledge acquisition. 

THEOREM (Shannon). When the number of events N tends to infinity, the average 
number of questions tends to Pil°g 2 (P«)- 

Comments. 1. This sum is known as an entropy of a probabilistic distribution {p t } and 
denoted by 5 or S({p<}). So Shannon’s theorem says that if we know the probabilities 
of all the outcomes, then the average number of questions that we have to ask in order 
to get a complete knowledge equals to the entropy of this probabilistic distribution. In 
other words: in case we know all the probabilities, the complexity of knowledge acquisition 
equals to the entropy of this probabilistic distribution. 

2. As promised, the result does not depend on k. 

3. Since we modified Shannon’s definitions, we cannot use the original proof. Our 
proof is given in Section 6. 
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3. DEMPSTER-SHAFER CASE 

Dempster-Shafer (DS) formalism in brief (Smets et a!, 1988). The basic element 
of knowledge in this formalism is as follows, an expert gives several hypotheses E 

about the real world (these hypotheses are not necessarily incompatibH and describe 

his degrees of belief m(E,), m(E 2 ), in each of these hypotheses. These values 

■ . j i pnnal to 1 There are also combination rules 

called masses, and their sum is supposed to be equal to 1. mere 

that allow us to combine the knowledge of several experts; as a result we again get a 
of hypotheses (that combine the hypotheses of several experts), and their masses ( egrees 

of belief). 

So in general the knowledge consists of a finite set of statements Ei,E 2 , -,E P about 
the real world, and a set of real numbers m(B f ) such that £ m(£j) = 1. 

What “complete knowledge” means in DS. This knowledge is incomplete: first of all, 
because we do not know which of the hypotheses E, is true. But even if we manage to figure 
that out, the uncertainty can still remain, because this hypothesis E t does not necessardy 
determine uniquely the state of our system. Therefore, if we want to estimate how ar we 
axe from the complete knowledge, we must know what is meant by a complete know e ge. 
In other words, we need to know the set W of possible states of the ana^zed system 
(these states are sometimes called possible worlds). Of course, there are mfimtely many 
states of any real objects, but usually we are interested only m finitely many proper .es 

p, p, Pm . It means that if for some pair of states s„s 2 each of these properties is true 

in’s, i’f rmd only if it is true in s 2 , then we consider them as one state. In this sense a 
state is uniquely determined by the m- dimensional Boolean vector, that consrsts of tmth 
values P,(s). So the set of all possible worlds consists of all such vectors, for w ic 
s with these properties is possible at all. 

Where do we take the masses from? In order to use this formalism to describe actual 
knowledge we must somehow assign the masses to the experts’ beliefs. The fact that Uhe 
sum of these masses equals to 1 prompts the interpretation of masses as probabrlrtres. A , 
indeed, the very formalism stemmed from probabilities, therefore probabrhsttc way rs one 

of the possible ways to estimate masses. 

For example, we can ask several experts what statement better describes their knowl- 
edge, take all these statements for E, and for m(E.) take the fraction 1 ( .)/- , w ere 
is the total number of experts, and JV(E.) is the number of experts whose knowledg 
described by the statement E„ Or, alternatively, we can ask one expert, and by analyzing 



the similar situations he can say that in the part m( Ej ) of all these cases a ^ 
was true. It is also possible that the expert does not know so many cases, but 
make a guess, based on his experience of likewise cases. 

There exist other methods to determine masses, that are not of probabilistic origin, 
but we’ll consider only probabilistic ones for 3 reasons (more detailed explanations o: e 

pro-probabilistic viewpoint can be found in Pearl, 1989, Dubois and Prade, 1989, Halpern 
and Fagin, 1990, Shafer and Pearl, 1990): 

We’ll consider only probabilistic methods to determine masses; why? 

1) There are arguments (starting from Savage, 1954, 1962) that if an expert assigns 
the degrees of belief to several mutually exclusive events, and assigns them in a ra lona 
manner, then they automatically satisfy all the properties of probabilities (they are ca e 
subjective probabilities). In Dempster-Shafer case, the mass m(E) represent an expert s 
degree of belief in the statement “the set of all possible alternatives coincides with E . 
Such statements for different E are mutually exclusive, and therefore, we can app y e 
above-mentioned arguments. 

2) Several non-probabilistic methods of assigning degrees of belief that we successfully 
applied turned out to have probabilistic origin; for example, for the rules of MYCIN, e 
famous successful expert system (Shortlife, 1976, Buchanan and Shortlife, 1984), it was 

proved in (Heckerman, 1986). 

3) Finally, in case we interpret masses as probabilities, we know precisely 

mean by saying that we believe in E, with the degree of belief namely, as we 1 s ow 

right now, this knowledge can be easily reformulated in terms of the future behavior of the 
system Therefore we can understand in precise terms, what is meant by this knowledge, 
and what knowledge do we need in addition so that we would be able to narrow our 
predictions to one actual outcome and thus get a complete knowledge. In case we do 
not use a probabilistic interpretation, what restrictions this knowledge imposes on fu ure 

outcomes is difficult to figure out. 

What does a DS knowledge mean? In case we accept a probabilistic interpretation, 
then the knowledge that the hypothesis E, is true with mass m(E,), can be interpreted as 
follows- if we have N similar events, then among these N cases there are approximately 
jVm(Ei) in which the outcomes satisfy the statement E, ; among the remaining ones there 
are approximately Nm(E 2 ) cases in which E 2 is true, etc. 


Warning This does not mean that E, is true only in 1 Vm(E,) cases. Accor mg 
to "the original interpretation of Dempster and Shafer, the relation between masses and 
probabilities is more complicated. In this interpretation when our knowle ge ^ 
a DS form, it means that we do not know all the probabilities p. Ins ea 
class V of probability distributions, that contains the actual distribution p. or eac even 
E different distributions p from this class lead to different values of P (E). These vidue 
form an interval [p".P + l- The smallest possible value (it is also called a lower pro a i . > ) 
is equal to our belief bel(E) in E, and the biggest possible value p coincides with 

plausibility pl(E ) of the event E . 

To illustrate this point, let us give an example when masses are different from proba- 
bilities. 

Example. Suppose that the whole knowledge of an expert is that to some extent he 
believes in some statement E. If we denote the corresponding degree of belief by m, we can 
express this knowledge in DS terms as follows: he believes in E, = E with degree o 
m(Ei) = m, and with the remaining degree of belief m(E 2 ) m e now ’ 

i e E 2 is a statement that is always true. In our terms this knowledge means that out o 
N events there are ss Nm, in which E is true, and « N(1 - m), in which E 2 is true, u 
E 2 is always true, so the only conclusion is that in at least * ^events E ,s true It is 
possible that E is always true (if it is also true for the remaining ; AT(1 - m) * 

it is also possible that E is true only in Nm cases (if E is false for the outcomes 

remaining events). 

We are almost ready to formalize this idea; the only problem is how to formalize 
“approximately". But since we interpret masses as probabilities we can apply the same 
statistical estimates as in the previous section. So we arrive at the following definitions. 

Definitions and the main result. 

Denotations. For any finite set X, we’ll denote by \X\ the number of its elements. 

Definitions. Suppose that a real number k > 0 is given. Suppose also that a finite 
set W is given. Its elements will be called outcomes, or possible worlds. 

Comment. In the following text we’ll suppose that the possible worlds are ordered, so 
that instead of talking about a world we can talk about its number • - W “ I* I- 
these terms W is equal to the set {1,2, ...,n). 



By a Dempster-Shafer knowledge or DS knowledge for short we mean a finite se 
pairs < E„mi >, 1 < < < » where * are subsets of V (called statements) and m, 
real numbers (called masses or degrees of belief) such that m, > 0 and £ "*• ' 

If an outcome r belongs to the set E„ well say that r satisfies E„ Suppose that an 
integer N is given; it is called the number of events. By a result of N events we mean a 
sequence r, 1 < * < * of integers from 1 to n. r. is called the outcome of^-th evenh 

We say that the result of N events is consistent with the DS knowledge < „ . - 

set {1>2 IV} can be divided into p subsets with no common elements m 

such a way that: 

1) if k belongs to Hj, then the outcome r k of fc-th event satisfies 

2) the number \H, | of elements in H, satisfies the inequality Nm,-k - m7)N < 

\Hi\ < Nrrii + ky/rrijy^rn^N . 

Let’s denote the number of all results, that are consistent with a given DS-knowledge, 
by N (N) The number \log 2 (N„„,(N))} will be called the number of questions nec- 
Lery to determine the results of N events and denoted by Q(N). The fraction Q(N)/N 
will be called the average number of questions. The limit of average number of questions, 
when N — oo, will be called the complexity of knowledge acquisition. 

To formulate our estimate we need some additional definitions. 

Definitions. By a probabilistic distribution we mean an array of n non-negative 
numbers p, , ..., p„ such that £ Pi = L We say that a probabilistic distribution is consis en 

P . , jp . ■ _ i „ if an d only if there exist non-negative 

with the DS knowledge < E t ,m t >, » - l.-iP, ana f 

numbers such that £, „ = », £, *r = "*< = 0 ^ “* **** l ° ^ 

Comments. 1. Informally, we want to divide the whole fraction m, of events, about 
which the expert predicted that E, is true, into the groups with fractions z f> for all J £ „ 

so that the outcomes in a group is j. 

2 This definition is not explicitly constructive, but if we fix a probabilistic distri- 
bution and a DS knowledge, the question whether they are consistent or not is a hnear 
programming problem, so we can use the known algorithms to solve ,t (simplex meth 

or the algorithm of Karmarkar (1984)). 

By an entropy of a DS knowledge we mean a maximum entropy of all probabilistic 
distributions that are consistent with it. 
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In other words, this entropy is a solution to a following mathematical problem: 

-£p; log 2 Pi -*■ max under the conditions that = Ph T,j Z d = m *’ Zi 3 - 0 

and Zij = 0 for j not in E„ where i runs from 1 to p, and j from 1 to n. 

If we substitute Pj = E we can reformulate it without using P f Entropy is a 
solution of the following mathematical optimization problem: 

max ’ 

i i 

under the conditions that Ej z »j = m *> z d — ^ and Z{ i ~ ^ ^ or ^ not * n ^ l ' 

Comments. 1. Entropy is a smooth convex function, all the restrictions are linear in z i} , 
so in order to compute the entropy of a given DS knowledge we must maximize a smooth 
convex function on a convex domain. In numerical mathematics there exist sufficiently 
efficient methods for doing that. 

2. For the degenerate case, when a DS knowledge is a probabilistic one, i.e., when 
n = p and E, = {?}, there is precisely one probabilistic distribution that is consistent with 
this DS knowledge: this very Pj , and therefore the entropy of a DS knowledge in this case 
coincides with Shannon’s entropy. 

MAIN THEOREM. The complexity of knowledge acquisition for a DS knowledge 
< E,,m t > is equal to the entropy of this knowledge. 

Comments. 1. Our definition of entropy is thus a natural generalization of Shannon’s 
entropy to a DS case. This not mean, of course, that this is the generalization. The 
notion of entropy is used not only to compute the average number of questions, but in 
several other applications: in communication theory, in pattern recognition, etc. Several 
different generalizations of entropy to DS formalism have been proposed and turned out to 
be efficient in these other problems (see, e.g., Yager, 1983, Pal and Datta Majumer, 1986, 
Dubois and Prade, 1987, Nguyen, 1987, Klir and Folger, 1988, Dubois and Prade, 19S9, 

Pal, 1991, Kosko, 1992). 

2. That the complexity of knowledge acquisition must be greater or equal that the 
entropy of a DS knowledge is rather easy to prove. Indeed, if a probabilistic distribution 
Pj is consistent with a DS knowledge, and a result of N events is consistent with this 
distribution, then it is consistent with a DS-knowledge as well. Therefore there are at 
least as many results consistent with DS knowledge as there are results consistent with 
Pj. Therefore the average number of questions in a DS case must be not smaller than the 



average number of questions (entropy) for every probabilistic distribution that is consistent 
with this knowledge. So it must be greater than or equal to the maximum of all such 
probabilistic entropies; and we have called this maximum an entropy of a DS knowledge. 
The fact that it is precisely equal, and not greater, is more difficult to prove, and demands 

combinatorics (see Section 6). 

4. THE ABOVE COMPLEXITY CHARACTERISTIC IS NOT SUFFI- 
CIENT, BUT WE NEED NOT WORRY ABOUT THAT 

Example. The above characteristic describes the average number of questions that we 
need to ask in order to attain the complete knowledge. However, we’ll now show that it 
is sometimes possible that we add the new information, and this characteristic remains 
the same. The simplest of such situations is as follows: suppose that there are only two 
possible outcomes. If we know nothing about them, this can be expressed m DS terms 
as follows: there is only one statement (p = 1), and this statement E x is identically true 
(i.e., Ei = W = {1,2}). In this case the above mathematical optimization problem is easy 
to solve, and yields 1. This result is intuitively very reasonable: if we know nothing, and 
there are two alternatives, we have to ask one binary question in order to figure out, which 
of the outcomes actually occurred. 

Suppose now that we analyzed the previous cases and came to a conclusion that on 
average in half of these cases the first outcome occurred, in half of them the second one. 
In other words, we add the new information that the probability of both outcomes is equal 
to 1/2. This is really a new information, because it diminishes the number of possibilities: 
For example, if we observed 100 events, in case we knew nothing it was quite possible that 
in all the cases we would observe the first outcome. In case we know that the probability is 
1/2, then the possible number N\ of cases, in which the first outcome oc curs, is restricted 
by the inequalities 1/2 • 100 - ^^1/2(1 - 1/2)100 < N x < 1/2 • 100 + ky/l/% 1 - 1/2)100, 
or 50 - 5k < Ni < 50 + 5 k. Even for k = 4 the value N x = 100 does not satisfy this 
inequality and is therefore negligibly rare (therefore for k < 4 it also cannot be equal to 
100). 

In other words, we added a new information. But if we compute the uncertainty 

(entropy) of the resulting probabilistic distribution, we get — 1/2 log 2 (l/2)— l/2log 2 (l/2) = 

^ . log2(l/2) = 1, i.e., again 1! We added the new information, but the uncertainty did 

not diminish. We still have to ask in average one question in order to get a complete 
knowledge. 



Isn’t it a paradox? No, because we were estimating the average amount of questions 
lim Q(N)/N. We have two cases, in which the necessary number of questions Qi(N ) in the 
first case is evidently bigger than in the second one ( Qi(N ) > £?2(-N))> but this difference 
disappears in the limit. In order to show that it is really so, let us compute Q(N) in both 
cases. 

If we know nothing, then all sequences of 1 and 2 are possible as the results, i.e., in 
this case N cons is equal to 2 jV . Therefore log 2 N cons = N, and Qi(N) = N cons ] = N. 

In the second case computations are more complicated (so we moved them to Section 
6), and the result for big N is Q 2 (N) = N — c, where c is a constant depending on k. Since 
c/N — ► 0, in the limit this difference disappears and so it looks like in these two cases the 
uncertainty is the same. 

Do we need to worry about that? To answer this question let’s give a numeric 
estimate of the difference between Q\(N) and (^(-N); this difference occurs only when 
the inequality N/2 — kN j 2 < N\ < N/2 + kN/2 really restricts the possible values of N. 
If k = 2, then for N < 4 all possible values of N\ from 0 to N satisfy it, so Q i = Q 2 - 
Therefore the difference starts only with N = 5. The bigger k , the bigger is the N, from 
which the difference appears. The value of this difference c = Qi(N ) — Q 2 {N) depends on 
k (see the proof in Section 6). The smaller the k, the bigger is c. The smallest value of k 
that is used in statistics is k = 2. For k = 2, we have c « 0.1. In comparison with 5 it is 
2%. For bigger N or bigger k it is even smaller. 

So this difference makes practical sense, if we can somehow estimate Q(N ) with a 
similar (or better) precision. But Q(N ) is computed from the initial degrees of belief 
(masses) m t . There is already a tiny difference between, say, 70% and 80% degree of 
belief, and hardly anyone can claim that in some cases he is 72% sure, and in some other 
cases 73%, and that he feels the difference. There are certainly not so many subjective 
degrees of belief. In view of that the degrees of belief are defined initially with at best 
5 — 10% precision. Therefore the values of Q(N ) are known with that precision only, and 
in comparison to that adding < 2% of c is, so to say, under the noise level. 

So the answer to the question in the title is: no, we don’t need to worry. 

5. PROBABILISTIC KNOWLEDGE 

Let’s analyze the case of a probabilistic knowledge as described in (Nilsson, 1986), 
when we know the probabilities of several statements. In this case, we can repeat the 
above-given definitions almost verbatim. 
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Definitions. Suppose that a real number k > 0 is given. Suppose also that a finite 
set W — {1,2, is given. Its elements will be called outcomes, or possible worlds. By 

a probabilistic knowledge we mean a finite set of pairs < Ei,p(E{ ) >, 1 < i < p, where E t 
are subsets of W and 0 < p(E t ) < 1. Subsets E t are called statements, and the number 
p(Ei) is called a probability of i-th statement. 

If an outcome r belongs to the set E{, we’ll say that r satisfies E x . 

Suppose that an integer N is given; it is called the number of events. By a result of N 
events we mean a sequence rjt, 1 < k < N of integers from 1 to n. rk is called the outcome of 
k-th event. We say that the result of N events is consistent with the probabilistic knowledge 
< Ei,p(E i) >, if for all i from 1 to p the number iV* of all r* that belong to E t satisfies 
the inequality Np(Ei ) — k^/p{Ei)( 1 ~—p{Ei))N < JVj < Np(Ei) + ky/p(Ei)( 1 — p(Ei))N ■ 

Let’s denote the number of all results, that are consistent with a given probabilistic 
knowledge, by N cona (N). The number \log 2 (N cons (N ))] will be called the number of 
questions, necessary to determine the results of N events and denoted by Q(N). The 
fraction Q(N)/N will be called the average number of questions. The limit of average 
number of questions, when N — * oo, will be called the complexity of knowledge acquisition. 

By a probabilistic distribution we mean an array of n non- negative numbers p\, ...,p n 
such that Pi = 1- We say that a probabilistic distribution is consistent with a proba- 
bilistic knowledge < E x ,p{E % ) >, i = if and only if for every i: J^jeEi = Pi- 

an entropy of a probabilistic knowledge we mean a maximum entropy of all probabilistic 
distributions that are consistent with it, i.e., the solution to a following mathematical op- 
timization problem: 1°S2 Pj ~ * max un der the conditions YljeEt Pi = Pj > 0 

and E"=i Pj = 1 - 

Comment. This is also a convex optimization problem. 

THEOREM. The complexity of knowledge acquisition for a probabilistic knowledge is 
equal to the entropy of this knowledge. 

Comments. 1. Main Theorem and this result can be combined as follows: if our knowledge 
is not sufficient to determine all the probabilities uniquely, so that several different proba- 
bilistic distributions are compatible with it, then the uncertainty of this knowledge is equal 
to the uncertainty of the distribution with the maximal entropy. It is worth mentioning 
that the distribution with maximal entropy has many other good properties, and is there- 
fore often used as a most “reasonable” one when processing incomplete data in science 
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(for a survey see Jaynes, 1979, and references therein; see also Kosheleva and Kreinovich 
(1979) and Cheeseman (1985)). 

2. Similar maximum entropy result can be proved for the case when part of the 
knowledge is given in a DS form, and part in a probabilistic form. In this case we can also 
formulate, what we mean by saying that probabilities axe consistent with a given knowl- 
edge, and prove that the complexity of knowledge acquisition is equal to the maximum 
entropy of all probabilistic distributions, that are consistent with a given knowledge. 


6. PROOFS 


Proof of Shannon’s Theorem. As we have mentioned in the main text, the Theorem 
that we prove is not the original Shannon’s, but its modification: Shannon was interested 
in data communication, and not in asking questions. So we must modify the proof. The 
proof that we are using first appeared in (Kreinovich, 1989). Let’s first fix some values N t , 
that are consistent with the given probabilistic distribution. Due to the inequalities that 
express the consistency demand, the ratio /, = N{/N tends to pi as N — ► oc. Let’s count 
the total number C of results, for which for every i the number of events with outcome i 
is equal to this IV,-. If we know C, we will be able to compute N cons by adding these C's. 

Actually we are interested not in N con3 itself, but in Q(N ) log 2 N con3 , and moreover, 
in lim (Q(N)/N). So we’ll try to estimate not only C, but also log 2 C and lim ((log 2 C)/N). 


To estimate C means to count the total number of sequences of length N , in which 
there are N\ elements, equal to 1, N 2 elements, equal to 2, etc. The total number C\ 
of ways to choose N\ elements out of N is well-known in combinatorics, and is equal to 
{ N ji) = N\/{{N\)\{N — Ni)\). When we choose these N\ elements, we have a problem in 
choosing N 2 out of the remaining N — JVj elements, where the outcome is 2; so for every 
choice of l’s we have C 2 = ( N ^ ) possibilities to choose 2’s. Therefore in order to get 
the total number of possibilities to choose l’s and 2’s, we must multiply C 2 by C\. Adding 
3’s, 4,s, ..., n’s, we get finally the following formula for C: 

r-rc r - m {n-n,)\ _ m 

1 2 ‘" n_1 ^.(N - N x )\ (N 2 \(N - N, — AT 2 )! JV^LJVJ 

To simplify computations let’s use the well-known Stirling formula, according to which &! 
is asymptotically equivalent to ( k / e) fc \/27r k. If we substitute these expressions into the 
above formula for C, we conclude that 

{N/eYW2^N_ 

~ {N l /e) N ^^2^{N 2 /e) N ^y/2^...{N n /e) N ^s/2^ 



Since Ni = N, terms e N and e Ni annihilate each other. 

' To get further simplification, we substitute Ni = Nf \ , and correspondingly 
as ( Nfi ) N * = N N f‘fi Nfi . Terms N n is the numerator and N n ^N n ^...N n ^ = 

fi + N fi+...+N f n _ jyN ^he d enom i na tor cancel each other. Terms with \/N lead 
to a term that depends on N as cN~^ n ~ 1 ’' 2 . Now we are ready to estimate log 2 C. Since 
logarithm of the product is equal to the sum of logarithms, and log a b — b log a, we conclude 
that log 2 C « -Nfi log 2 fi - Nf 2 log 2 f 2 - ... - Nf n log 2 f n - l/2(n - 1) log 2 N - const. 
When N — > oo, we have 1/N — > 0, log 2 N/N — » 0 and /, — * pi , therefore log 2 C/N — ► 
— Pi log 2 pi — p 2 log 2 p 2 — ... — p n log 2 p n , i.e., log 2 C/N tends to the entropy of the proba- 
bilistic distribution. 

Comment. We used the denotation A ~ B for some expressions .4 and B meaning that the 
difference between A and B is negligible in the limit N — » oo (i.e., the resulting difference 
in (log 2 C)/N tends to 0). 

Now, that we have found an asymptotic expression for C, let’s compute N con3 and 
Q(N)/N. For a given probabilistic distribution {p,} and every i possible values of iV, 
form an interval of length Li = 2ky/pi(l — pi)\/~N. So there are no more than L, possible 
values of N{. The maximum value for p^(l — p,-) is attained when p, = 1/2, therefore 
p,(l — pj) < 1/4, and hence Li < 2k\jN /A = ky/N /2. For every i from 1 to n there are at 
most (k/2)\/N possible values of iV^, so the total number N co of possible combinations of 
Ni is smaller than {{k/2)y/N) n . 

The total number N cons of consistent results is the sum of N co different values of 
C (that correspond to different combinations N\,N 2 , ..., N n ). Let’s denote the biggest of 
these C by C max - Since N cons is the sum of N co terms, and each of them is not greater 
than the biggest of them C max , we conclude, that N cons < iV co C max . On the other hand, 
the sum N cons is bigger than each of its terms, i.e., C max < N cons . Combining these two 
inequalities, we conclude, that C m!LX < N con s < N co C ma . x . Since N co < ((k/2)\/ r N) n , we 
conclude that C max < N CO ns < ((k /2)y/~N) n C ma , x . Turning to logarithms, we find that 
l°g 2 (C , ma X ) < log 2 (iV con3 ) < log 2 (C max ) + (n/2)log 2 iV + con^. Dividing by N, tending to 
the limit N — > oo and using the fact that limiv-.oo(log 2 N)/N = 0 and the already proved 
fact that log 2 (C max )/iV tends to the entropy 5, we conclude that lim Q(N)/N = S. Q.E.D. 

Proof of the Main Theorem. Let’s denote by hi some integer numbers that satisfy 
the inequalities Nmi — kyjrnfl — m/yN < hi < Nmi + fc\/m;(l — m/jN from Section 3. 
Let’s denote the ratios hi/N by Due to these inequalities, when N — * oo, gi — ► m,-. 
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Unlike the previous Theorem, even if we know g x , i.e., know how many outcomes 
belong to, Ei for every i, we still cannot uniquely determine the frequencies fj of different 
outcomes. If there exists a result of N events with given frequencies gi and fj, then we 
can further subdivide each set Hi into subsets Z X} that correspond to different outcomes 
j € E t . In this case Z l} = hi and Z tJ = N fj-, therefore the frequencies tij = Zij/N 

satisfy the equalities ^ • tij = gi and tij = fj. Vice versa, if there exist values t tJ such 

that these two equalities are satisfied, and Ntij is an integer for all i, j , then we can divide 
W into sets of size hi, each of them into sets with Ntij elements and thus find a result with 
given gi and fj. If such t tJ exist, we’ll say that the frequencies g, and fj are consistent 
(note an evident analogy between this concept and the definition of consistency between 
a DS knowledge and a probabilistic distribution). 

Let’s now prove, that if the set of frequencies {fj} is consistent with the set {<7,}, and 
we have a result, in which there are N f\ outcomes that axe equal to 1, N f 2 outcomes that 
are equal to 2, etc., then this result is consistent with the original DS knowledge. Indeed, 
we can subdivide the set of all the outcomes, that are equal to j, into subsets with Ntij 
elements for all i such that j € Ei. We’ll say that the elements that are among these 
Ntij ones are labeled by i. Totally there axe Ntij = NJ 2 jtij — N gi = hi elements, 
that are labelled by i, and for all of them Ei is true. Since hi was chosen so as to satisfy 
the inequalities that are necessary for consistency, we conclude that this result is really 
consistent with a DS knowledge. 

The number C of results with given frequencies {fj} has already been computed in 
the proof of Shannon’s theorem: lim ((log 2 C)/N) = fj l°g2 fj- 

The total number of the results N cons , that are consistent with a given DS knowledge, 
is the sum of N co different values of C, that correspond to different fj. For a given N 
there are at most N + 1 different values of Ni = N /1 (0,1 ,...,N), at most N + 1 different 
values of N2, etc., totally at most ( N + l) n different sets of {fj}. So, like in the proof of 
Shannon’s theorem, we get an inequality C max £ N con3 < ( N + l) n C max , from which we 
conclude, that lim Q(N)/N = \im(log2C m 3 LX )/N. 

When N — ► 00, the values gi tend to mi, and therefore these frequencies f } tend 
to the probabilities p } , that are consistent with a DS knowledge. Therefore (log 2 C)/N 
tends to the entropy of the limit probabilistic distribution, and (log 2 C ma , x )/N tends to the 
maximum of such entropies. But this maximum is precisely the entropy of a DS knowledge 
as we defined it. So l\m(Q(N)/N) equals to the entropy of a DS knowledge. Q.E.D. 
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The estimates for a probabilistic case axe proved likewise. 

Proof of the statement from Section 4. We have to consider the case, when n = 2 
(there are two possible outcomes). In this case the result of N events is a sequence of l’s 
and 2’s. A result is consistent with our knowledge if and only if the number N\ of l’s 
satisfies the inequality N/2 - ky/W /2 < Ni < N/2 4- ky/N (actually we must demand that 
the likewise inequality is true for N 2 = N — N\, but one can easily see that this second 
inequality is equivalent to the first one). Let’s estimate the number N cons of such results. 

In order to get this estimate let’s use the following trick. Suppose that we have N 
independent equally distributed random variables r*, each of which attains two possible 
values 1 and 2 with equal probability 1/2. Then the probability of each of 2 jV possible 
sequences of l’s and 2’s is the same: 2~ N . The probability P that a random sequence 
satisfies the above inequalities is equal to the sum of the probabilities of all the sequences 
that satisfy it, i.e., is equal to the sum of N con3 terms, that are equal to 2 _iV . So P = 
N cons 2~ N . Therefore, if we manage to estimate P, we’ll be able to reconstruct N con3 by 
using a formula N cons = 2 N P. 

So let us estimate P. Let’s recall the arguments that lead to the inequalities that we 
are using. The total number N\ of l’s in a sequence {r*} is equal to the sum of terms that 
are equal to 1 if r* = 1 and to 0 if r* = 2. In other words, it is the sum of 2 — r*. So N\ is 
the sum of several equally distributed variables, and therefore for big N its distribution is 
close to Gaussian, with the average N/2 and the standard deviation a = \[N /2. Therefore 
for big N the probability that N\ satisfies the above inequalities is equal to the probability 
that the value of a Gaussian random variable with the average a and standard deviation 
a lies between a — kcr and a + ka. This probability P depends only on k and does not 
depend on N at all. For example, for k = 2 P « 0.95, and for bigger k P is bigger. Since 
Ncona = P2 N , we conclude, that Q(N) « log 2 (P2 N ) = N — c, where c = - log 2 P. For 
k = 2 we get c = — log 2 P « 0.1, and for bigger k it is even smaller. 
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